[Core] add client side health-check to detect network failures. #31640

scv119 · 2023-01-12T19:16:26Z

Why are these changes needed?

Occasionally Ray users have seen ray.get hanging, when the node executing the task ray.get is waiting for is preempted and disconnected from the cluster.

As we debug one instance of such hanging, we figured this is caused by that the underlining grpc channel failed to detect this network failure.

To solve this problem, we need add some sort of health check at OS level (TCP keep alive), rpc level (grpc), or application (Ray) level. It seems not easy to configure TCP Keepalive in grpc, and the Ray level involves changing a lot of code, this PR made the change at grpc level.

Also note in Ray we assume network failure as component failure, we set up a more loose timeout to reduce the false positive.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

cadedaniel · 2023-01-12T19:53:02Z

Should we also do this here

ray/src/ray/rpc/gcs_server/gcs_rpc_client.h

Line 195 in bc3114d

channel_ = BuildChannel(address, port, arguments);

or why not?

rickyyx

Is there a way to test this?

And how about the python client's configs?

src/ray/common/ray_config_def.h

scv119 · 2023-01-12T21:32:11Z

cc @shomilj

rkooo567

request change until addressing Ricky's comments!

src/ray/common/ray_config_def.h

rkooo567

we may need to change config from the python side. I am not sure if we have grpc client other than https://github.com/ray-project/ray/blob/master/python/ray/_private/gcs_pubsub.py. Maybe we should aggregate all grpc client usage to a single file for the global config for python too?

src/ray/common/ray_config_def.h

src/ray/common/grpc_util.h

scv119 · 2023-01-13T08:15:15Z

hmm the challenging part is to simulate the failure where we terminate the node without sending FIN on the tcp connection..

scv119 · 2023-01-13T18:47:32Z

Tried both reboot and preempt spot instances while job is running, the Ray is able to detect in both case node failed.
However, we are not 100% sure we reproduced the exact problem our customer encounters.

scv119 · 2023-01-13T18:49:08Z

#24969

add client side health-checl

eeababb

scv119 assigned rickyyx, rkooo567, fishbone and pcmoritz Jan 12, 2023

scv119 marked this pull request as ready for review January 12, 2023 19:20

scv119 added 2 commits January 12, 2023 11:31

add

db7f231

make it configurable

bc3114d

rickyyx reviewed Jan 12, 2023

View reviewed changes

src/ray/common/ray_config_def.h Show resolved Hide resolved

src/ray/common/ray_config_def.h Show resolved Hide resolved

scv119 added 4 commits January 12, 2023 15:01

add

0b63187

fix build failure

572603f

add

605999e

add

a6e61c6

rkooo567 requested changes Jan 13, 2023

View reviewed changes

src/ray/common/ray_config_def.h Show resolved Hide resolved

rkooo567 reviewed Jan 13, 2023

View reviewed changes

src/ray/common/ray_config_def.h Show resolved Hide resolved

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023

add

0fb9952

scv119 commented Jan 13, 2023

View reviewed changes

src/ray/common/grpc_util.h Outdated Show resolved Hide resolved

scv119 added 2 commits January 12, 2023 17:26

add

4fe708c

add

c5cc355

scv119 force-pushed the health-check branch from 369321e to c5cc355 Compare January 13, 2023 01:42

scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023

rkooo567 approved these changes Jan 13, 2023

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023

scv119 merged commit 0ca11dc into ray-project:master Jan 13, 2023

rickyyx mentioned this pull request Mar 31, 2023

[core] Corrupted cluster state due to restarting actors associated with finished jobs prevents autoscaler from scaling down #33984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] add client side health-check to detect network failures. #31640

[Core] add client side health-check to detect network failures. #31640

scv119 commented Jan 12, 2023 •

edited

Loading

cadedaniel commented Jan 12, 2023

rickyyx left a comment •

edited

Loading

scv119 commented Jan 12, 2023 •

edited

Loading

rkooo567 left a comment

rkooo567 left a comment

scv119 commented Jan 13, 2023

scv119 commented Jan 13, 2023

scv119 commented Jan 13, 2023

[Core] add client side health-check to detect network failures. #31640

[Core] add client side health-check to detect network failures. #31640

Conversation

scv119 commented Jan 12, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

cadedaniel commented Jan 12, 2023

rickyyx left a comment • edited Loading

Choose a reason for hiding this comment

scv119 commented Jan 12, 2023 • edited Loading

rkooo567 left a comment

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

scv119 commented Jan 13, 2023

scv119 commented Jan 13, 2023

scv119 commented Jan 13, 2023

scv119 commented Jan 12, 2023 •

edited

Loading

rickyyx left a comment •

edited

Loading

scv119 commented Jan 12, 2023 •

edited

Loading